NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

DyG-DPCD: A Distributed Parallel Community Detection Algorithm for Large-Scale Dynamic Graphs

https://doi.org/10.1007/s10766-024-00780-1

Sattar, Naw Safrin; Ibrahim, Khaled Z; Buluc, Aydin; Arifuzzaman, Shaikh (February 2025, International Journal of Parallel Programming)

Full Text Available
Fast multiplication of random dense matrices with sparse matrices

Liang, Tianyu; Murray, Riley; Buluc, Aydin; Demmel, James (May 2024, IEEE International Parallel & Distributed Processing Symposium 2024)

This work focuses on accelerating the multiplication of a dense random matrix with a (fixed) sparse matrix, which is frequently used in sketching algorithms. We develop a novel scheme that takes advantage of blocking and recomputation (on- the-fly random number generation) to accelerate this operation. The techniques we propose decrease memory movement, thereby increasing the algorithm’s parallel scalability in shared memory architectures. On the Intel Frontera architecture, our algorithm can achieve 2x speedups over libraries such as Eigen and Intel MKL on some examples. In addition, with 32 threads, we can obtain a parallel efficiency of up to approximately 45%. We also present a theoretical analysis for the memory movement lower bound of our algorithm, showing that under mild assumptions, it's possible to beat the data movement lower bound of general matrix-matrix multiply (GEMM) by a factor of sqrt(M), where $$M$$ is the cache size. Finally, we incorporate our sketching method into a randomized algorithm for overdetermined least squares with sparse data matrices. Our results are competitive with SuiteSparse for highly overdetermined problems; in some cases, we obtain a speedup of 10x over SuiteSparse.
more » « less
Distributed-Memory Randomized Algorithms for Sparse Tensor CP Decomposition

https://doi.org/10.1145/3626183.3659980

Bharadwaj, Vivek; Malik, Osman Asif; Murray, Riley; Buluc, Aydin; Demmel, James (June 2024, Annual ACM Symposium on Parallelism in Algorithms and Architectures)

Candecomp / PARAFAC (CP) decomposition, a generalization of the matrix singular value decomposition to higher-dimensional tensors, is a popular tool for analyzing multidimensional sparse data. On tensors with billions of nonzero entries, computing a CP decomposition is a computationally intensive task. We propose the first distributed-memory implementations of two randomized CP decomposition algorithms,CP-ARLS-LEV and STS-CP, that offer nearly an order-of-magnitude speedup at high decomposition ranks over well-tuned non-randomized decomposition packages. Both algorithms rely on leverage score sampling and enjoy strong theoretical guarantees, each with varying time and accuracy tradeoffs. We tailor the communication schedule for our random sampling algorithms, eliminating expensive reduction collectives and forcing communication costs to scale with the random sample count. Finally, we optimize the local storage format for our methods, switching between analogues of compressed sparse column and compressed sparse row formats. Experiments show that our methods are fast and scalable,producing 11x speedup over SPLATT by decomposing the billion-scale Reddit tensor on 512 CPU cores in under two minutes.
more » « less
Exploring temporal community evolution: algorithmic approaches and parallel optimization for dynamic community detection

https://doi.org/10.1007/s41109-023-00592-1

Sattar, Naw Safrin; Buluc, Aydin; Ibrahim, Khaled Z.; Arifuzzaman, Shaikh (December 2023, Applied Network Science)

Abstract Dynamic (temporal) graphs are a convenient mathematical abstraction for many practical complex systems including social contacts, business transactions, and computer communications. Community discovery is an extensively used graph analysis kernel with rich literature for static graphs. However, community discovery in a dynamic setting is challenging for two specific reasons. Firstly, the notion of temporal community lacks a widely accepted formalization, and only limited work exists on understanding how communities emerge over time. Secondly, the added temporal dimension along with the sheer size of modern graph data necessitates new scalable algorithms. In this paper, we investigate how communities evolve over time based on several graph metrics under a temporal formalization. We compare six different algorithmic approaches for dynamic community detection for their quality and runtime. We identify that a vertex-centric (local) optimization method works as efficiently as the classical modularity-based methods. To its advantage, such local computation allows for the efficient design of parallel algorithms without incurring a significant parallel overhead. Based on this insight, we design a shared-memory parallel algorithmDyComPar, which demonstrates between 4 and 18 fold speed-up on a multi-core machine with 20 threads, for several real-world and synthetic graphs from different domains.
more » « less
Full Text Available
Fast Exact Leverage Score Sampling from Khatri-Rao Products with Applications to Tensor Decomposition

Bharadwaj, Vivek; Malik, Osman Asif; Murray, Riley; Grigori, Laura; Buluc, Aydin; Demmel, James (December 2023, Neural Information Processing Systems 2023)

We present a data structure to randomly sample rows from the Khatri-Rao product of several matrices according to the exact distribution of its leverage scores. Our proposed sampler draws each row in time logarithmic in the height of the Khatri-Rao product and quadratic in its column count, with persistent space overhead at most the size of the input matrices. As a result, it tractably draws samples even when the matrices forming the Khatri-Rao product have tens of millions of rows each. When used to sketch the linear least squares problems arising in CANDECOMP / PARAFAC tensor decomposition, our method achieves lower asymptotic complexity per solve than recent state-of-the-art methods. Experiments on billion-scale sparse tensors validate our claims, with our algorithm achieving higher accuracy than competing methods as the decomposition rank grows.
more » « less
Atos: A Task-Parallel GPU Scheduler for Graph Analytics

https://doi.org/10.1145/3545008.3545056

Chen, Yuxin; Brock, Benjamin; Porumbescu, Serban; Buluc, Aydin; Yelick, Katherine; Owens, John (August 2022, Proceedings of the 51st International Conference on Parallel Processing)

We present Atos, a task-parallel GPU dynamic scheduling framework that is especially suited to dynamic irregular applications. Compared to the dominant Bulk Synchronous Parallel (BSP) frameworks, Atos exposes additional concurrency by supporting task-parallel formulations of applications with relaxed dependencies, achieving higher GPU utilization, which is particularly significant for problems with concurrency bottlenecks. Atos also offers implicit task-parallel load balancing in addition to data-parallel load balancing, providing users the flexibility to balance between them to achieve optimal performance. Finally, Atos allows users to adapt to different use cases by controlling the kernel strategy and task-parallel granularity. We demonstrate that each of these controls is important in practice. We evaluate and analyze the performance of Atos vs. BSP on three applications: breadth-first search, PageRank, and graph coloring. Atos implementations achieve geomean speedups of 3.44x, 2.1x, and 2.77x and peak speedups of 12.8x, 3.2x, and 9.08x across three case studies, compared to a state-of-the-art BSP GPU implementation. Beyond simply quantifying the speedup, we extensively analyze the reasons behind each speedup. This deeper understanding allows us to derive general guidelines for how to select the optimal Atos configuration for different applications. Finally, our analysis provides insights for future dynamic scheduling framework designs.
more » « less
Full Text Available
GraphBLAS: C++ Iterators for Sparse Matrices

https://doi.org/10.1109/IPDPSW55747.2022.00053

Brock, Benjamin; McMillan, Scott; Buluc, Aydin; Mattson, Timothy G.; Moreira, Jose E. (May 2022, IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW))

Full Text Available
Scaling Generalized N-Body Problems, A Case Study from Genomics

https://doi.org/10.1145/3472456.3472517

Ellis, Marquita; Buluc, Aydin; Yelick, Katherine (August 2021, ICPP 2021: 50th International Conference on Parallel Processing)
null (Ed.)
Full Text Available
Accelerating large scale de novo metagenome assembly using GPUs

https://doi.org/10.1145/3458817.3476212

Awan, Muaaz Gul; Hofmeyr, Steven; Egan, Rob; Ding, Nan; Buluc, Aydin; Deslippe, Jack; Oliker, Leonid; Yelick, Katherine (November 2021, The International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’21))

Full Text Available
Distributed-Memory k-mer Counting on GPUs

https://doi.org/10.1109/IPDPS49936.2021.00061

Nisa, Israt; Pandey, Prashant; Ellis, Marquita; Oliker, Leonid; Buluc, Aydin; Yelick, Katherine (May 2021, International Parallel and Distributed Processing Symposium)
null (Ed.)
Full Text Available

« Prev Next »

Search for: All records